{{Short description, Data representation for machine learning Most real world data sets consist of data vectors whose individual components are not

statistically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...

. In other words, knowing the value of an element will provide information about the value of elements in the data vector. When this occurs, it can be desirable to create a factorial code of the data, i. e., a new vector-valued representation of each data vector such that it gets uniquely encoded by the resulting code vector (loss-free coding), but the code components are statistically independent. Later

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

usually works much better when the raw input data is first translated into such a factorial code. For example, suppose the final goal is to classify images with highly redundant pixels. A

naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

will assume the pixels are

random variables A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

and therefore fail to produce good results. If the data are first encoded in a factorial way, however, then the naive Bayes classifier will achieve its

optimal Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

performance (compare Schmidhuber et al. 1996). To create factorial codes,

Horace Barlow Horace Basil Barlow FRS (8 December 1921 – 5 July 2020) was a British vision scientist. Life Barlow was the son of the civil servant Sir Alan Barlow and his wife Lady Nora (granddaughter of the naturalist Charles Darwin). He was educated ...

and co-workers suggested to minimize the sum of the

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...

entropies of the code components of

binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...

codes (1989).

Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artif ...

(1992) re-formulated the problem in terms of predictors and binary

feature Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software item ...

detectors A sensor is a device that produces an output signal for the purpose of sensing a physical phenomenon. In the broadest definition, a sensor is a device, module, machine, or subsystem that detects events or changes in its environment and sends ...

, each receiving the raw data as an input. For each detector there is a predictor that sees the other detectors and learns to predict the output of its own detector in response to the various input vectors or images. But each detector uses a

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

algorithm to become as unpredictable as possible. The

global optimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...

of this

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

corresponds to a factorial code represented in a distributed fashion across the outputs of the feature detectors. Painsky, Rosset and Feder (2016, 2017) further studied this problem in the context of

independent component analysis In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents ar ...

over finite alphabet sizes. Through a series of theorems they show that the factorial coding problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. In addition, they introduce a simple transformation (namely, order permutation) which provides a greedy yet very effective approximation of the optimal solution. Practically, they show that with a careful implementation, the favorable properties of the order permutation may be achieved in an asymptotically optimal computational complexity. Importantly, they provide theoretical guarantees, showing that while not every random vector can be efficiently decomposed into independent components, the majority of vectors do decompose very well (that is, with a small constant cost), as the dimension increases. In addition, they demonstrate the use of factorial codes to data compression in multiple setups (2017).

References

, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural Computation, 1:412-423, 1989. *

. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992 * J. Schmidhuber and M. Eldracher and B. Foltin. Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996 * A. Painsky, S. Rosset and M. Feder. Generalized independent component analysis over finite alphabets. IEEE Transactions on Information Theory, 62(2):1038-1053, 2016 * A. Painsky, S. Rosset and M. Feder. Large Alphabet Source Coding using Independent Component Analysis. IEEE Transactions on Information Theory, 63(10):6514 - 6529, 2017 Independence (probability theory) Signal processing

See also

References